The American Journal of Human Genetics — Latest Matching Preprints

1

Higher eQTL power reveals signals that boost GWAS colocalization

Rosen, J. D.; Broadaway, K. A.; Brotman, S. M.; Mohlke, K. L.; Love, M. I.

2025-08-05 genetics 10.1101/2025.08.05.668745 medRxiv

Top 0.1%

61.1%

Show abstract

Expression quantitative trait locus (eQTL) studies in human cohorts typically detect at least one regulatory signal per gene, and have been proposed as a way to explain mechanisms of genetic liability for other traits, as discovered in genome-wide association studies (GWAS). In particular, eQTL signals may colocalize with GWAS signals, suggesting gene expression as a possible mediator. However, recent studies have noted colocalization occurs infrequently, even when expression is measured in biologically relevant tissues. Most eQTL studies to date include only hundreds of individuals, and are underpowered to discover distal regulatory signals explaining smaller fractions of gene expression variance. We integrate evidence from recent eQTL studies and demonstrate that limited statistical power due to sample size skews the detection of eQTL signals identified at various signal strengths. We estimate that a sample size of 500 detects <0.1 to 60% of eQTL for a range of signal strengths and that a sample size of 2,000 would detect 36.8% of all eQTL. We show that eQTL signals that can only be discovered in larger studies exhibit characteristics more similar to those of GWAS signals, including greater distance to the regulated gene and higher probability of loss intolerance. Finally, using results from recent eQTL studies and meta-analyses, we observe a large increase in detected colocalizations with GWAS signals compared to previous studies. These findings caution against overinterpreting the absence of colocalization in underpowered studies and provide guidance for designing future eQTL experiments, to improve power and complement perturbation-based approaches in characterizing gene-trait mechanisms.

2

Leveraging Clinical, Functional, Molecular and Population Genetic Data Reveals Genotype Phenotype Association and Health Disparity in a Monogenic Disorder, CTX

Hanson, J.; Bonnen, P. E.

2024-04-16 genetic and genomic medicine 10.1101/2024.04.15.24305853 medRxiv

Top 0.1%

59.2%

Show abstract

Cerebrotendinous Xanthomatosis (CTX) is a lipid storage disease caused by recessively inherited pathogenic variants in CYP27A1 (OMIM 213700). The classic clinical presentation includes infantile-onset chronic diarrhea, juvenile-onset bilateral cataracts, with development of tendon xanthomas and progressive neurological dysfunction. These multisystem clinical features typically appear in different decades of life often confounding diagnosis of CTX. Further complicating diagnosis is the generally held belief that the clinical presentation of CTX varies highly between individuals and even within families. CTX is a treatable disorder and treatment is most effective when started in the first two decades of life, rendering a particular urgency to diagnosis. In this study we bring a novel approach to detecting genotype phenotype associations in CTX. We conducted a systematic review of the literature to identify all functional analyses of pathogenic CYP27A1 variants at the level of mRNA, protein and enzyme activity. We identified missense variants that result in complete loss of function (LOF) as well as missense variants that are have some partial function (hypomorphs). Next, we identified every CTX patient in the medical literature whose genotype and clinical phenotype were reported, and binned them according to functional genotype: LOF vs hypomorph. Analysis of these clinical, biochemical and molecular genetics data revealed a clear genotype phenotype association for CTX based on individuals who had two LOF variants vs two hypomorphs. The prevalence of each clinical feature was significantly higher in individuals with two LOF variants for every feature except tendon xanthoma and pyramidal signs. CTX had a detrimental effect on cognition for almost everyone with two LOF variants (96%), while tendon xanthomas were the most common feature in individuals with two hypomorphs (88%). We suspect this is due to ascertainment bias; individuals with a milder form of CTX may not get diagnosed with CTX unless they have this unusual hallmark of the disease. We studied the population genetics of the pathogenic CYP27A1 alleles in gnomAD (N[~]800,000). Estimated disease incidence based on carrier frequencies was consistent across the African/African American, Admixed American and European populations (1/308,000). However, no African/African American individuals have been reported in the medical literature as having CTX. Analyses of the pathogenic alleles in each population showed that the frequency of hypomorph pathogenic CYP27A1 alleles was twice as high in African/African Americans (p=3.6E-4) vs Europeans (p=1.2E-4). Conversely, LOF alleles had a lower frequency in African/African Americans than in Europeans, p=6.1E-4 vs p=8.6E-4, respectively. By combining clinical, molecular, functional and populations genetics we uncovered a large health disparity in the diagnosis and treatment of CTX in African Americans and point to the milder clinical presentation of hypomorphs as an underlying component. The results of this study reveal specific opportunities for mitigating this disparity through recognition of the milder form of CTX as a clinical entity that is driven by hypomorph genetic alleles and broad adoption of biochemical testing that utilizes more sensitive biomarkers. Applying the framework and concepts leveraged in this study to the diagnosis of all monogenic disorders will likely result in improved diagnosis and health equity for the rare disease community. Key findingsO_LIJoint analysis of clinical, functional, molecular, and population genetic data reveals health disparity in African Americans in a rare monogenic disorder, CTX. C_LIO_LIThe gene that causes CTX, CYP27A1, harbors pathogenic missense variants that are loss of function and other pathogenic missense variants that are hypomorphs. C_LIO_LIGenotype phenotype analyses based on functional genotype - loss of function vs hypomorph - revealed a phenotype x functional genotype association for CTX. C_LIO_LIIndividuals with loss of function genotype have a significantly more severe clinical presentation than those with a hypomorph genotype. C_LIO_LINearly all individuals with CTX who have a loss of function genotype have detrimental effects to their cognition (96%). The only exceptions to this received treatment with CDCA in the first decade of life. C_LIO_LIPopulation genetic analyses estimate that incidence of CTX is consistent across Blacks and Whites but systematic review of the medical literature returned no Black individuals having been reported to have CTX. C_LIO_LIHypomorph pathogenic variants in CYP27A1 occur more frequently in African/African Americans (p=3.6E-4) than Europeans (p=1.2E-4). The milder clinical presentation of the hypomorph genotype likely contributes to the under-diagnosis and misdiagnosis of African/African Americans with CTX. C_LI

3

An updated map of GRCh38 linkage disequilibrium blocks based on European ancestry data

MacDonald, J.; Harrison, T.; Bammler, T.; Mancuso, N.; Lindstroem, S.

2022-03-07 genetics 10.1101/2022.03.04.483057 medRxiv

Top 0.1%

57.9%

Show abstract

A map of approximately independent linkage disequilibrium (LD) blocks has many uses in statistical genetics. Current publicly available LD block maps are based on sparse recombination maps and are only available for GRCh37 (hg19) and prior genome assemblies. We generated LD blocks in GRCh38 coordinates for African (AFR), East Asian (EAS), European (EUR) and South Asian (SAS) ancestry populations. These new maps consist of 1,143 (EAS) - 1,604 (AFR) independent LD blocks across the 22 autosomal chromosomes and can be accessed at https://github.com/jmacdon/LDblocks_GRCh38.

4

Pitfalls in estimating and interpreting the contribution of ultra-rare genetic variants to the heritability of complex traits

Wang, H.; Wainschtein, P.; Sidorenko, J.; Fikere, M.; Zhang, Y.; Kemper, K. E.; Zheng, Z.; Hivert, V.; Zeng, J.; Goddard, M. E.; Visscher, P. M.; Yengo, L.

2026-04-07 genetic and genomic medicine 10.64898/2026.04.06.26350278 medRxiv

Top 0.1%

53.5%

Show abstract

Assessing the contribution of ultra-rare variants (minor allele frequency <0.01%) to the heritability of complex traits remains challenging due to limited understanding of potential biases. Here, we focus on singletons (that is, variants observed only once in the study sample), the most abundant class of ultra-rare variants, to showcase various confounders of heritability estimates and underline pitfalls in their interpretation. We show through theory, simulations, and analysis of 5,330,210 exome-sequenced singletons in 305,813 unrelated European-ancestry individuals in the UK Biobank that (i) population stratification induces both upward and downward biases in singleton-based heritability estimates (), (ii) estimates capture non-additive genetic effects, and (iii) asymptotic standard errors of estimates from likelihood-based procedures are generally mis-calibrated when traits are not normally distributed. We further showcase these biases in real-data analyses of 22 quantitative phenotypes and report, after accounting for these pitfalls, significant estimate for number of children (3.4%), peak expiratory flow (1.9%), red blood cell count (2.5%), white blood cell count (1.9%) and heel bone mineral density (2.4%). Overall, our study provides recommendations for robust inference of heritability from ultra rare variants and underscores that reliable estimates for ordinal and binary traits will require far larger sample sizes and improved methods, given that confounding in these traits remains difficult to detect and correct

5

Robust Mixed Model Association Test for Gene-Environment Interactions

Zhang, M.; Tang, J.; Brown, M. R.; Morrison, A. C.; Boerwinkle, E.; Manning, A. K.; Liu, C.-T.; Chen, H.

2025-10-03 genetic and genomic medicine 10.1101/2025.10.01.25336808 medRxiv

Top 0.1%

53.0%

Show abstract

Linear mixed models (LMMs) are widely used in gene-environment interaction (GEI) studies to account for population structure and relatedness. However, genome-wide GEI tests using LMMs are computationally intensive, and model-based tests can yield inflated type I error rates when environmental main effects are misspecified. While robust inference methods exist for unrelated samples, challenges remain for related individuals. A common workaround is a two-step approach that first adjusts for relatedness via an LMM and then uses residuals in a standard linear model, but its validity for GEI studies is unclear. We propose a robust mixed model association test (RoM) for large-scale GEI analysis in related samples. RoM uses the Huber-White sandwich estimator and offers efficient computation, scaling linearly with sample size when cluster sizes are bounded. Simulations show that RoM achieves better type I error control at genome-wide significance levels than both the two-step method and alternative strategies. We apply RoM to GEI analyses of waist-hip ratio (WHR) with BMI using data from the Framingham Heart Study (7,264 related individuals), ARIC (9,312 individuals with repeated measures), and WHR with sex using data from UK Biobank (407,068 related individuals), confirming robust error control and comparable signal detection.

6

Substantial role of rare inherited variation in individuals with developmental disorders

Samocha, K. E.; Chundru, V. K.; Fu, J. M.; Gardner, E. J.; Danecek, P.; Wigdor, E. M.; Malawsky, D. S.; Lindsay, S. J.; Campbell, P.; Singh, T.; Eberhardt, R. Y.; Gallone, G.; Wright, C. F.; Martin, H. C.; Firth, H. V.; Hurles, M. E.

2024-08-29 genetic and genomic medicine 10.1101/2024.08.28.24312746 medRxiv

Top 0.1%

52.7%

Show abstract

While the role of de novo and recessively-inherited coding variation in risk for rare developmental disorders (DDs) has been well established, the contribution of damaging variation dominantly-inherited from parents is less explored. Here, we investigated the contribution of rare coding variants to DDs by analyzing 13,452 individuals with DDs, 18,613 of their family members, and 3,943 controls using a combination of family-based and case/control analyses. In line with previous studies of other neuropsychiatric traits, we found a significant burden of rare (allele frequency < 1x10-5) predicted loss-of-function (pLoF) and damaging missense variants, the vast majority of which are inherited from apparently unaffected parents. These predominantly inherited burdens are strongest in DD-associated genes or those intolerant of pLoF variation in the general population, however we estimate that [~]10% of the excess of these variants in DD cases is found within the DD-associated genes, implying many more risk loci are yet to be identified. We found similar, but attenuated, burdens when comparing the unaffected parents of individuals with DDs to controls, indicating that parents have elevated risk of DDs due to these rare variants, which are overtransmitted to their affected children. We estimate that 6-8.5% of the population attributable risk for DDs are due to rare pLoF variants in those genes intolerant of pLoF variation in the general population. Finally, we apply a Bayesian framework to combine evidence from these analyses of rare, mostly-inherited variants with prior de novo mutation burden analyses to highlight an additional 25 candidate DD- associated genes for further follow up.

7

Per-allele disease and complex trait effect sizes are predominantly African MAF-dependent in European populations

Rossen, J.; Strober, B. J.; Hou, K.; Kerner, G.; Price, A. L.

2026-01-02 genetic and genomic medicine 10.64898/2025.12.31.25343290 medRxiv

Top 0.1%

52.0%

Show abstract

Understanding genetic architectures of disease is fundamental to partitioning heritability, polygenic risk prediction, and statistical fine-mapping. Genetic architectures of disease in European populations have been shown to depend on European minor allele frequency (MAF): SNPs with lower MAF have larger per-allele effects, due to the action of negative selection. However, we hypothesized that African MAF (defined using African-ancestry segments in African Americans), which is not distorted by the out-of-Africa bottleneck, might better predict per-allele effect sizes of common genetic variation in European populations; we note that common variants explaining most disease heritability are typically much older than the split between African and non-African populations. To demonstrate this, we first analyze the proportion of non-synonymous SNPs, which are strongly impacted by negative selection. The proportion of non-synonymous SNPs is much better predicted by African MAF than European MAF; a mixture of African MAF with weight w=0.95 (95% CI: (0.93, 0.96)) and European MAF with weight (1-w) is a more powerful predictor than either European MAF (P<10-15, 3.65x greater increase in log-likelihood relative to a null model without MAF dependence) or African MAF (P<10-15). Next, we consider the widely used model, in which per-allele GWAS effect size variance is proportional to [(1 - )], where pE is the European MAF. We propose a different model in which per-allele effect size variance is proportional to [(1 - )], where pmix=w*pA+(1-w)*pE, and pA is the African MAF. We fit the mix model by extending the baseline-LD model used in S-LDSC to include a grid of bivariate African and European MAF bins and identifying values of w and mix that best fit mean effect size variance estimates from S-LDSC across bivariate MAF bins. We demonstrate that our approach provides conservative estimates of w in simulations. We applied this approach to summary statistics for 50 diseases/complex traits in European populations (average N=483K) and estimated best-fit parameters of w=0.96 (95% CI: (0.76, 1.16)) and mix=-0.34 (95% CI: (-0.67, -0.02)), attaining a far better fit than the standard model using pE only (P<10-15, 4.53x greater decrease in mean-squared error relative to a null model without MAF dependence). We conclude that per-allele disease and complex trait effect sizes are predominantly African MAF-dependent in European populations.

8

Low generalizability of polygenic scores in African populations due to genetic and environmental diversity

Majara, L.; Kalungi, A.; Koen, N.; Zar, H.; Stein, D. J.; Kinyanda, E.; Atkinson, E. G.; Martin, A. R.

2021-01-14 genetics 10.1101/2021.01.12.426453 medRxiv

Top 0.1%

51.4%

Show abstract

African populations are vastly underrepresented in genetic studies but have the most genetic variation and face wide-ranging environmental exposures globally. Because systematic evaluations of genetic prediction had not yet been conducted in ancestries that span African diversity, we calculated polygenic risk scores (PRS) in simulations across Africa and in empirical data from South Africa, Uganda, and the UK to better understand the generalizability of genetic studies. PRS accuracy improves with ancestry-matched discovery cohorts more than from ancestry-mismatched studies. Within ancestrally and ethnically diverse South Africans, we find that PRS accuracy is low for all traits but varies across groups. Differences in African ancestries contribute more to variability in PRS accuracy than other large cohort differences considered between individuals in the UK versus Uganda. We computed PRS in African ancestry populations using existing European-only versus ancestrally diverse genetic studies; the increased diversity produced the largest accuracy gains for hemoglobin concentration and white blood cell count, reflecting large-effect ancestry-enriched variants in genes known to influence sickle cell anemia and the allergic response, respectively. Differences in PRS accuracy across African ancestries originating from diverse regions are as large as across out-of-Africa continental ancestries, requiring commensurate nuance.

9

Fine-tuning Polygenic Risk Scores with GWAS Summary Statistics

Zhao, Z.; Yi, Y.; Wu, Y.; Zhong, X.; Lin, Y.; Hohman, T. J.; Fletcher, J.; Lu, Q.

2019-10-18 genetics 10.1101/810713 medRxiv

Top 0.1%

50.9%

Show abstract

Polygenic risk scores (PRSs) have wide applications in human genetics research. Notably, most PRS models include tuning parameters which improve predictive performance when properly selected. However, existing model-tuning methods require individual-level genetic data as the training dataset or as a validation dataset independent from both training and testing samples. These data rarely exist in practice, creating a significant gap between PRS methodology and applications. Here, we introduce PUMAS (Parameter-tuning Using Marginal Association Statistics), a novel method to fine-tune PRS models using summary statistics from genome-wide association studies (GWASs). Through extensive simulations, external validations, and analysis of 65 traits, we demonstrate that PUMAS can perform a variety of model-tuning procedures (e.g. cross-validation) using GWAS summary statistics and can effectively benchmark and optimize PRS models under diverse genetic architecture. On average, PUMAS improves the predictive R2 by 205.6% and 62.5% compared to PRSs with arbitrary p-value cutoffs of 0.01 and 1, respectively. Applied to 211 neuroimaging traits and Alzheimers disease, we show that fine-tuned PRSs will significantly improve statistical power in downstream association analysis. We believe our method resolves a fundamental problem without a current solution and will greatly benefit genetic prediction applications.

10

Investigating the role of common cis-regulatory variants in modifying penetrance of putatively damaging, inherited variants in severe neurodevelopmental disorders

Wigdor, E. M.; Samocha, K. E.; Eberhardt, R. Y.; Chundru, V. K.; Firth, H. V.; Wright, C. F.; Hurles, M. E.; Martin, H. C.

2023-04-25 genetic and genomic medicine 10.1101/2023.04.20.23288860 medRxiv

Top 0.1%

46.8%

Show abstract

Recent work has revealed an important role for rare, incompletely penetrant inherited coding variants in neurodevelopmental disorders (NDDs). Additionally, we have previously shown that common variants contribute to risk for rare NDDs. Here, we investigate whether common variants exert their effects by modifying gene expression, using multi-cis-expression quantitative trait loci (cis-eQTL) prediction models. We first performed a transcriptome-wide association study for NDDs using 6,987 probands from the Deciphering Developmental Disorders (DDD) study and 9,720 controls, and found one gene, RAB2A, that passed multiple testing correction (p = 6.7x10-7). We then investigated whether cis-eQTLs modify the penetrance of putatively damaging, rare coding variants inherited by NDD probands from their unaffected parents in a set of 1,700 trios. We found no evidence that unaffected parents transmitting putatively damaging coding variants had higher genetically-predicted expression of the variant-harboring gene than their child. In probands carrying putatively damaging variants in constrained genes, the genetically-predicted expression of these genes in blood was lower than in controls (p = 2.7x10-3). However, results for proband-control comparisons were inconsistent across different sets of genes, variant filters and tissues. We find limited evidence that common cis-eQTLs modify penetrance of rare coding variants in a large cohort of NDD probands.

11

The impact of rare pathogenic CNVs is exacerbated by assortative mating.

Cevallos, C.; Auwerx, C.; Hofmeister, R.; Cavinato, T.; Schoeler, T.; Kutalik, Z.; Reymond, A.

2025-09-12 genetic and genomic medicine 10.1101/2025.09.08.25335316 medRxiv

Top 0.1%

46.2%

Show abstract

Copy-number variants (CNVs) are linked to a spectrum of outcomes and carriers of the same variant exhibit variable disease severity. We explored the impact of an individuals polygenic score (PGS) on explaining these differences, focusing on 119 established CNV-trait associations involving 43 clinically-relevant phenotypes. We called CNVs among white British UK Biobank participants, then divided samples into a training set (n = 264,372) to derive independent PGS weights, and a CNV-carrier-enriched test set (n = 96,716) for which PGSs were evaluated. Assessing the individual, joint, and synergistic contribution of CNVs and PGS, we identified a significant additive effect for 45 (38%) CNV-trait pairs but no evidence for interactions. A (spurious) negative correlation between an individuals CNV carrier status and their PGS would be expected under selective participation-induced collider bias. Instead, we observed a widespread positive correlation, which could only be partially accounted for by linkage disequilibrium. Given a non-null inheritance rate for all 17 testable CNVs, we explored whether assortative mating could explain the positive CNV-PGS association. We found strong agreement between this correlation and the one predicted by assortment (r = 0.45, p = 3.9 x 10-7). Similar trends of positive correlation were observed between PGS and genome-wide burden of CNVs or rare loss-of-function variants. Our results suggest that PGSs contribute to the variable expressivity of CNVs and rare variants, and improve the identification of those at higher risk of clinically relevant comorbidities. We also highlight pervasive assortative mating as a likely mechanism contributing to the compounding of genetic effects across mutational classes.

12

Development and validation of polygenic scores for within-family prediction of disease risks

Moore, S.; Davidson, I.; Anomaly, J.; Li, J. H.; Ahangari, M.; Moissiy, L.; Christensen, M.; Young, A. S.; Stern, D.; Wolfram, T.

2025-08-08 genetic and genomic medicine 10.1101/2025.08.06.25333145 medRxiv

Top 0.1%

44.2%

Show abstract

The clinical implementation of polygenic scores (PGSs) for disease risk prediction, particularly in reproductive health applications, requires rigorous validation. Here, we develop seventeen disease PGSs by conducting large-scale GWAS meta-analyses, and we validate our scores in out-of-sample prediction analyses. We achieve state-of-the-art predictive performance, consistently matching or outperforming academic and commercial benchmarks, with liability R2 reaching up to 0.21 (type 2 diabetes). The performance of a PGS for embryo screening depends on its predictive ability within-family, which can be lower than its prediction ability among unrelated individuals. However, very few disease PGSs have been tested within-family. We perform systematic within-family validation of our disease PGSs, finding no decrease in predictive performance within-family for 16 of 17 scores. PGS performance typically declines with genetic distance from training data, an effect that needs to be accounted for to give properly calibrated predictions across ancestries. We perform extensive calibration of our scores performance across different ancestries, finding improved cross-ancestry performance compared to previous approaches, especially in African and East Asian populations. This is likely due to the fact our scores are constructed using a method that incorporates functional genomic annotations on more than 7 million variants, enabling a degree of fine-mapping of causal variants shared across ancestries. We illustrate clinical utility through examining the risk reduction that could be achieved through embryo screening for type 2 diabetes: selecting among 10 embryos is expected to reduce absolute disease risk by 12-20% in families where both parents are affected, with similar relative risk reductions across ancestries. These findings establish a framework for implementing PGS in reproductive medicine while demonstrating both the technologys potential for disease prevention and the methodological standards required for responsible clinical translation.

13

Accounting for Isoform Expression in eQTL Mapping Substantially Increases Power

LaPierre, N.; Pimentel, H.

2023-06-30 genetics 10.1101/2023.06.28.546921 medRxiv

Top 0.1%

43.4%

Show abstract

A core problem in genetics is eQTL mapping, in which genetic variants associated with changes in expression of genes are identified. It is common in eQTL mapping to compute gene expression by aggregating the expression levels of individual isoforms from the same gene and then performing linear regression between SNPs and this aggregated gene expression level. However, SNPs may regulate isoforms from the same gene in different directions due to alternative splicing, or only regulate the expression level of one isoform, causing this approach to lose power. In this study, we provide a systematic evaluation of methods for accounting for individual isoform expression levels based on generative isoform expression heritability models and real data. Over a range of conditions, we show that these approaches substantially increase the power to map eQTLs in both simulations and commonly analyzed large data sets. We identify settings in which different approaches yield an inflated number of false discoveries or lose power. In particular, we show that calling an eGene if there is a significant association between a SNP and any isoform fails to control False Discovery Rate, even when applying standard False Discovery Rate correction. We show that similar trends are observed in real data from the GEUVADIS and GTEx studies, suggesting the possibility that similar effects are present in these consortia.

14

Clinical application of Complete Long Read genome sequencing identifies a 16kb intragenic duplication in EHMT1 in a patient with suspected Kleefstra syndrome

Gorzynski, J. E.; Marwaha, S.; Reuter, C. M.; Jensen, T.; Ferrasse, A.; Raja, A.; Fernandez, L.; Kravets, E.; Carter, J.; Bonner, D.; Sutton, S.; Undiagnosed Diseases Network (UDN), ; Ruzhnikov, M.; Hudgins, L.; Fisher, P. G.; Bernstein, J.; Wheeler, M. T.; Ashley, E. A.

2024-03-29 genetic and genomic medicine 10.1101/2024.03.28.24304304 medRxiv

Top 0.1%

42.3%

Show abstract

Long read sequencing offers benefits for the detection of structural variation in Mendelian disease. Here, we applied a new technology that generates contiguous long reads via tagmentation and sequencing by synthesis to a small cohort of patients with undiagnosed disease from the Undiagnosed Diseases Network. We first compare sequencing from the HG002 benchmark sample from Genome In A Bottle using nanopore sequencing (R10.4.1, duplex reads, Oxford Nanopore), single molecule real time sequencing (Revio SMRT cell, Pacific Biosciences) and complete long read sequencing (S4 flowcell, Novaseq, Illumina). Coverage was 33-35x across platforms. Read length N50 was 6.5kb (ICLR), 16.9kb (SMRT), and 33.8kb (ONT). We noted small differences in single nucleotide variant F1 scores across long read technologies with single nucleotide variant F1 scores (0.985-0.999) exceeding indel scores (0.78-0.99) and structural variant scores (0.74-0.96). We applied CLR sequencing to seven undiagnosed patients. In one patient, we detected and prioritized a novel 16kb intragenic duplication encompassing exons 5 and 6 in EHMT1. Resolution of the breakpoints and examination of flanking sequences revealed that the duplication was present in tandem and was predicted to result in a frameshift of the amino acid sequence and an early termination codon. It resulted in a diagnosis of Kleefstra syndrome. The variant was confirmed with targeted EHMT1 clinical testing and detected via nanopore and SMRT sequencing. In summary, we report the early clinical application of complete long read sequencing to a small cohort of undiagnosed patients.

15

Beyond thresholds: a fully Bayesian framework for quantifying allele count evidence for variant pathogenicity

Konovalov, F. A.

2026-02-10 genetics 10.64898/2026.02.09.704882 medRxiv

Top 0.1%

42.2%

Show abstract

Allele count data from affected individuals and population controls are central to variant interpretation, yet their evidential meaning is often mediated by discrete thresholds and implicit assumptions. This work introduces a fully quantitative Bayesian framework for dominant rare disease genetics in which all allele count evidence is summarized by a single quantity, the Bayes factor, that evaluates the probability of observing the same data under two explicitly defined competing models. Rather than replacing individual ACMG/AMP pathogenicity criteria, the Bayes factor provides a unified measure that naturally incorporates evidence in both the pathogenic and benign directions. The framework accounts for variation in affected cohort size, penetrance, disease prevalence, and assay error rates, allowing these biologically and technically meaningful quantities to be specified directly instead of absorbed into fixed cutoffs. Application to a non-Finnish European population shows that the dependence of the Bayes factor on observed allele counts is strongly shaped by how the affected cohort is defined and by false positive rates in control datasets. Across representative scenarios, Bayes factor values are broadly compatible with established allele count criteria combinations expressed on odds-ratio scales under typical parameterizations, while remaining tunable beyond these defaults.

16

Rare germline disorders implicate long non-coding RNAs disrupted by chromosomal structural rearrangements

Andersen, R. E.; Alkuraya, I. F.; Ajeesh, A.; Sakamoto, T.; Mena, E. L.; Amr, S. S.; Romi, H.; Kenna, M. A.; Robson, C. D.; Wilch, E. S.; Nalbandian, K.; Pina-Aguilar, R.; Walsh, C. A.; Morton, C. C.

2024-06-19 genetic and genomic medicine 10.1101/2024.06.16.24307499 medRxiv

Top 0.1%

41.6%

Show abstract

In recent years, there has been increased focus on exploring the role the non-protein-coding genome plays in Mendelian disorders. One class of particular interest is long non-coding RNAs (lncRNAs), which has recently been implicated in the regulation of diverse molecular processes. However, because lncRNAs do not encode protein, there is uncertainty regarding what constitutes a pathogenic lncRNA variant, and thus annotating such elements is challenging. The Developmental Genome Anatomy Project (DGAP) and similar projects recruit individuals with apparently balanced chromosomal abnormalities (BCAs) that disrupt or dysregulate genes in order to annotate the human genome. We hypothesized that rearrangements disrupting lncRNAs could be the underlying genetic etiology for the phenotypes of a subset of these individuals. Thus, we assessed 279 cases with BCAs and selected 191 cases with simple BCAs (breakpoints at only two genomic locations) for further analysis of lncRNA disruptions. From these, we identified 66 cases in which the chromosomal rearrangements directly disrupt lncRNAs. Strikingly, the lncRNAs MEF2C-AS1 and ENSG00000257522 are each disrupted in two unrelated cases. Furthermore, in 30 cases, no genes of any other class aside from lncRNAs are directly disrupted, consistent with the hypothesis that lncRNA disruptions could underly the phenotypes of these individuals. To showcase the power of this genomic approach for annotating lncRNAs, here we focus on clinical reports and genetic analysis of two individuals with BCAs and additionally highlight six individuals with likely developmental etiologies due to lncRNA disruptions.

17

Functional Predictors of Causative Cis-Regulatory Mutations in Mendelian Disease

Bengani, H.; Grozeva, D.; Moyon, L.; Bhatia, S.; Louros, S. R.; Hope, J.; Jackson, A.; Prendergast, J.; Owen, L. J.; Naville, M.; Rainger, J.; Grimes, G.; Halachev, M.; Murphy, L. C.; Boskovic, O. S.; Heyningen, V. v.; Kind, P.; Abbott, C. M.; Osterweil, E.; Raymond, L.; Roest Crollius, H.; FitzPatrick, D.

2020-08-03 genetics 10.1101/2020.08.03.232926 medRxiv

Top 0.1%

41.4%

Show abstract

Undiagnosed neurodevelopmental disease is significantly associated with rare variants in cis-regulatory elements (CRE) but demonstrating causality is challenging as target gene consequences may differ from a causative variant affecting the coding region. Here, we address this challenge by applying a procedure to discriminate likely diagnostic regulatory variants from those of neutral/low-penetrant effect. We identified six rare CRE variants using targeted and whole genome sequencing in 48 unrelated males with apparent X-linked intellectual disability (XLID) but without detectable coding region variants. These variants segregated appropriately in families and altered conserved bases in predicted CRE targeting known XLID genes. Three were unique and three were rare but too common to be plausibly causative for XLID. We compared the cis-regulatory activity of wild-type and mutant alleles in zebrafish embryos using dual-color fluorescent reporters. Two variants showed striking changes: one plausibly causative (FMR1CRE) and the other likely neutral/low-penetrant (TENM1CRE). These variants were "knocked-in" to mice and both altered embryonic neural expression of their target gene. Only Fmr1CRE mice showed disease-relevant behavioral defects. FMR1CRE is plausibly disease-associated resulting in complex misregulation of Fmr1/FMRP rather than loss-of-function. This is consistent both with absence of Fragile X syndrome in the probands and the observed electrophysiological anomalies in the FMR1CRE mouse brain. Although disruption of in vivo patterns of endogenous gene expression in disease-relevant tissues by CRE variants cannot be used as strong evidence for Mendelian disease association, in conjunction with extreme rarity in human populations and with relevant knock-in mouse phenotypes, such variants can become likely pathogenic.

18

ImputePGTA: accurate embryo genotyping and polygenic scoring from ultra-low-pass sequencing

Li, J. H.; Wolfram, T.; Davidson, I.; Schleede, J.; Swift, J.; Moore, S.; Stern, D.; Christensen, M.; Young, A. I.

2025-11-09 genetic and genomic medicine 10.1101/2025.11.07.25339763 medRxiv

Top 0.1%

41.3%

Show abstract

Preimplantation genetic testing (PGT) for polygenic risk (PGT-P) holds great promise for reducing lifetime disease burden, but genotyping embryos remains difficult. PGT for aneuploidy (PGT-A) is a routine test used in over half of in vitro fertilization cycles in the United States, typically via ultra-low-pass (ULP) sequencing ([~]0.004x) or, less commonly, genotyping arrays. Here we describe an approach that enables accurate embryo genotyping from PGT-A data when combined with estimated parental haplotypes. We develop a Coupled Hidden Markov Model, ImputePGTA, which jointly infers inheritance patterns from parents to offspring as well as phasing errors in parental haplotypes, along with an inference algorithm that scales linearly with the number of embryos. The performance of our approach depends on the phasing of parental haplotypes, which we improve through a method, phaseGrafter, that combines evidence from short and long reads, further enabling imputation of rare variants. We validate our approach through simulations and comparison of embryo genomes reconstructed from real PGT-A data to post-birth whole genome sequencing data. When using long reads for parental phasing, we achieve a dosage correlation of 0.98 with high-quality post-birth genotypes, and a mean absolute difference of 0.11 standard deviations across 17 disease polygenic scores, lower than from imputation of genotyping array data from reference panels. Uncertainty from imputation from ULP PGT-A data with accurate parental phasing results in only a [~]2% attenuation in expected gains from embryo selection for typical embryo cohort sizes. Our approach removes an important technological barrier to using PGT-P and is already facilitating more widespread adoption.

19

A framework evaluating the utility of multi-gene, multi-disease population-based panel testing that accounts for uncertainty in penetrance estimates

Liang, J. W.; Christensen, K. D.; Green, R. C.; Kraft, P.

2023-01-04 genetics 10.1101/2022.08.10.503415 medRxiv

Top 0.1%

41.3%

Show abstract

Panel germline testing allows for efficient detection of deleterious variants for multiple conditions, but the benefits and harms of identifying these variants are not always well-understood. We present a multi-gene, multi-disease aggregate utility formula that allows the user to consider adding or removing each gene in a panel based on variant frequency; estimated penetrances; and subjective disutilities for testing positive but not developing the disease and testing negative but developing the disease. We provide credible intervals for utility that reflect uncertainty in penetrance estimates. Rare, highly-penetrant deleterious variants tend to contribute positive net utilities for a wide variety of user-specified disutilities, even when accounting for parameter estimation uncertainty. However, the clinical utility of deleterious variants with moderate, uncertain penetrance depends more on assumed disutilities. The decision to include a gene on a panel depends on variant frequency, penetrance, and subjective utilities, and should account for uncertainties around these factors.

20

Multi-ancestry polygenic risk scores using phylogenetic regularization

Layne, E.; Zabad, S.; Li, Y.; Blanchette, M.

2024-02-17 bioinformatics 10.1101/2024.02.14.580313 medRxiv

Top 0.1%

41.1%

Show abstract

Accurately predicting phenotype using genotype across diverse ancestry groups remains a significant challenge in human genetics. Many state-of-the-art polygenic risk score models are known to have difficulty generalizing to genetic ancestries that are not well represented in their training set. To address this issue, we present a novel machine learning method for fitting genetic effect sizes across multiple ancestry groups simultaneously, while leveraging prior knowledge of the evolutionary relationships among them. We introduce DendroPRS, a machine learning model where SNP effect sizes are allowed to evolve along the branches of the phylogenetic tree capturing the relationship among populations. DendroPRS outperforms existing approaches at two important genotype-to-phenotype prediction tasks: expression QTL analysis and polygenic risk scores. We also demonstrate that our method can be useful for multiancestry modelling, both by fitting population-specific effect sizes and by more accurately accounting for covariate effects across groups. We additionally find a subset of genes where there is strong evidence that an ancestry-specific approach improves eQTL modelling.